Shallow Text Analysis and Machine Learning for Authorship Attribtion
نویسندگان
چکیده
Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, token-based (e.g., sentence length) and lexical features (e.g., vocabulary richness) can be kept roughly constant over the different authors. This allows us to focus on the use of syntax-based features as possible predictors for an author’s style, as well as on those token-based features that are predictive to author style more than to topic or register. These style characteristics are not under the author’s conscious control and therefore good clues for Authorship Attribution. Machine Learning methods (TiMBL and the WEKA software package) are used to select informative combinations of syntactic, token-based and lexical features and to predict authorship of unseen documents. The combination of these features can be considered an implicit profile that characterizes the style of an author.
منابع مشابه
Shallow Text Analysis and Machine Learning for Authorship At- tribution
Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, toke...
متن کاملLinguistic correlates of style: authorship classification with deep linguistic analysis features
The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as opposed to the content of a text. Various feature sets and classification methods have been proposed in the literature, geared towards abstracting away from the content of a text, and focusing on its ...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملInvited talk: Text Analysis and Machine Learning for Stylometrics and Stylogenetics
Automatic Text Categorization, learning to assign documents to specific categories (e.g. in topic assignment or spam filtering), has been an influential application in Natural Language Processing. These systems consist of two components: a first one that constructs representations of documents (mostly bags of words represented as binary or numeric vectors), and a second one that uses standard m...
متن کاملPlagiarism and authorship analysis: introduction to the special issue
The Internet has facilitated both the dissemination of anonymous texts as well as easy ‘‘borrowing’’ of ideas and words of others. This has raised a number of important questions regarding authorship. Can we identify the anonymous author of a text by comparing the text with the writings of known authors? Can we determine if a text, or parts of it, has been plagiarized? Such questions are clearl...
متن کامل